Extracting the Main Content from HTML Documents

نویسنده

  • Samuel Louvan
چکیده

A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawlers, and web miners. Therefore, extracting main contents from web document and removing noisy contents is an important process. In this paper we present an approach for extracting main content from web documents which combines classification tasks and heuristic approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimized Content Extraction from web pages using Composite Approaches

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...

متن کامل

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

This chapter presents R2L, DANA and DANAg, a family of novel algorithms for extracting the main content (MC) of web documents. The main concept behind R2L, which also provided the initial idea and motivation for the other two algorithms, is to exploit particularities of Right-to-Left languages for obtaining the MC of web pages. As the English character set and the Right-toLeft character set are...

متن کامل

STAN: Structural Analysis for Web Documents

In this paper we present STAN, a structural analysis tool used for classifying web documents while at the same time extracting meaningful information from them. The extraction and classification rules are defined in terms of a structrural grammar operating on both layout properties and content properties of the document. Stan was designed to accept HTML as input and is able to process documents...

متن کامل

Extraction of Core Contents from Web Pages

The information available on web pages mostly contains semi-structured text documents which are represented either in XML, or HTML, or XHTML format that lacks formatted document structure. The document does not discriminate between the text and the schema that represent the text. Also the amount of structure used to represent the text depends on the purpose and size of text document. No semanti...

متن کامل

Extracting and Modeling the Semantic Information Content of Web Documents to Support Semantic Document Retrieval

Existing HTML mark-up is used only to indicate the structure and lay-out of documents, but not the document semantics. As a result web documents are difficult to be semantically processed, retrieved and explored by computer applications. Existing information extraction system mainly concerns with extracting important keywords or key phrases that represent the content of the documents. The seman...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009